GPU集群的Kubernetes平台监控方案

使用gpu-monitoring-tools,基于Prometheus Operator及kube-prometheus,来监控运行在Nvidia GPU节点集群上、基于Kubernetes的机器学习平台。

监控方案

  1. gpu-monitoring-tools(以下简称gmt)的metrics采集包含多套方案:
    1. NVML Go Bindings(C API)。
    2. DCGM exporter(Prometheus metrics on DCGM)。
  2. gmt的监控框架提供了多套方案:
    1. 直接利用DCGM exporter的Prometheus DaemonSet,只有采集和监控。
    2. Prometheus Operator + Kube-Prometheus(经Nvidia修改),包含完整的采集、监控、告警、图形化等组件。

我们采用第二套监控框架方案,且这套方案的功能对于没有GPU的CPU机器仍然有效。
经验证,这套方案可以同时监控宿主机硬件(CPU、GPU、内存、磁盘等)、Kubernetes核心组件(apiserver、controller-manager、scheduler等)以及运行在Kubernetes上的业务服务的运行状态。

什么是Operator

  1. 对于无状态的应用,原生Kubernetes的资源(如Deployment)能很好地支持自动扩缩容、自动重启以及升级。
  2. 对于有状态的应用,如数据库、缓存、监控系统等,需要根据特定的应用进行不同的运维操作。
  3. Operator针对特定的应用将运维操作封装进软件,并将Kubernetes API通过第三方资源进行扩展,允许用户创建、配置、管理应用,通常包含一系列Kubernetes CRD的集合。
  4. 与Kubernetes的Controller和Resource对应关系类似,Operator根据用户提交给Controller的请求,将实际的实例数和实例状态维持在与用户期望相同的效果,但封装了许多细节的操作。

前置准备

镜像

将下列镜像导入到集群的所有节点上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 如果使用原始的kubernetes搭建Prometheus,使用这两个镜像创建资源,但仅能获取metrics,没有对接监控告警
# docker pull nvidia/dcgm-exporter:1.4.6
# docker pull quay.io/prometheus/node-exporter:v0.16.0

# operator基础镜像
docker pull quay.io/coreos/prometheus-operator:v0.17.0
docker pull quay.io/coreos/hyperkube:v1.7.6_coreos.0

# exporters
docker pull nvidia/dcgm-exporter:1.4.3
docker pull quay.io/prometheus/node-exporter:v0.15.2

# prometheus组件
docker pull quay.io/coreos/configmap-reload:v0.0.1
docker pull quay.io/coreos/prometheus-config-reloader:v0.0.3
docker pull gcr.io/google_containers/addon-resizer:1.7
docker pull gcr.io/google_containers/kube-state-metrics:v1.2.0
docker pull quay.io/coreos/grafana-watcher:v0.0.8
docker pull grafana/grafana:5.0.0
docker pull quay.io/prometheus/prometheus:v2.2.1

helm模板

下载并解压下列helm模板:

1
2
3
4
wget https://nvidia.github.io/gpu-monitoring-tools/helm-charts/kube-prometheus-0.0.43.tgz
tar zxvf kube-prometheus-0.0.43.tgz
wget https://nvidia.github.io/gpu-monitoring-tools/helm-charts/prometheus-operator-0.0.15.tgz
tar zxvf prometheus-operator-0.0.15.tgz

安装步骤

1. 配置
节点标签

对于需要监控的GPU node打上标签。

1
kubectl label no <nodename> hardware-type=NVIDIAGPU

外源etcd

对于外源的etcd,即etcd不以容器的方式随Kubernetes集群初始化而启动,而是在事前启动的外部etcd集群,需要指定etcd的集群地址。
假设外源etcd的IP为etcd0,etcd1,etcd2,外部访问端口为2379,使用HTTP直接访问。

1
vim kube-prometheus/charts/exporter-kube-etcd/values.yaml

1
2
3
4
5
6
7
#etcdPort:  4001
etcdPort: 2379

#endpoints: []
endpoints: [etcd0,etcd1,etcd2]
scheme: http
...

同时,需在grafana插入图表数据,注意:

  1. 在465行处添加一个”,”。
  2. 465行前为"title": "Crashlooping Control Plane Pods"的面板。
  3. 在465行处添加以下内容,注意保持缩进。
    1
    vim kube-prometheus/charts/grafana/dashboards/kubernetes-cluster-status-dashboard.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": true,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 5,
"w": 6,
"x": 0,
"y": 11
},
"id": 14,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "(sum(up{job=\"kube-etcd\"} == 1) / count(up{job=\"kube-etcd\"})) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "50, 80",
"title": "External etcd Servers UP",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
}
暴露端口

暴露prometheus、alertmanager、grafana的访问端口,以备排障。这些端口需要能从开发VPC直接访问。

1
vim kube-prometheus/values.yaml

1
2
3
4
5
6
7
8
9
10
11
12
alertmanager:
...
service:
...
nodePort: 30779
type: NodePort
prometheus:
...
service:
...
nodePort: 30778
type: NodePort
1
vim kube-prometheus/charts/grafana/values.yaml
1
2
3
service:
nodePort: 30780
type: NodePort
告警接收器

配置告警接收器,通常我们选择在同一个集群内的ControlCenter Service来接收,并将告警信息转换格式后转发给IMS。

1
vim kube-prometheus/values.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
alertmanager:
config:
route:
receiver: 'webhook_test'
routes:
- match:
alertname: DeadMansSwitch
receiver: 'webhook_test'
- match:
severity: critical
receiver: 'webhook_test'
- match:
severity: warning
receiver: 'webhook_test'
receivers:
- name: 'webhook_test'
webhook_configs:
- send_resolved: true
# short for controlcenter.default.svc or controlcenter.default.svc.cluster.local
url: "http://controlcenter.default:7777/api/alerts"
告警规则

平台监控包括Node硬件(CPU、内存、磁盘、网络、GPU)、K8s组件(Kube-Controller-Manager、Kube-Scheduler、Kubelet、API Server)、K8s应用(Deployment、StatefulSet、Pod)等。
由于篇幅较长,因此将监控告警规则放在附录。

2. 启动
1
2
cd prometheus-operator
helm install . --name prometheus-operator --namespace monitoring
1
2
cd kube-prometheus
helm install . --name kube-prometheus --namespace monitoring
3. 清理
1
helm delete --purge kube-prometheus
1
helm delete --purge prometheus-operator

常见问题

无法暴露Kubelet metrics

在1.13.0版本的kubernetes未出现此问题。

  1. 对于1.13.0之前的版本,需将获取kubelet metrics的方式由https改为http,否则Prometheus的kubelet targets将down掉。[github issue 926]
    1
    vim kube-prometheus/charts/exporter-kubelets/templates/servicemonitor.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
spec:
endpoints:
# - port: https-metrics
# scheme: https
- port: http-metrics
scheme: http
...
# - port: https-metrics
# scheme: https
- port: http-metrics
scheme: http
path: /metrics/cadvisor
...
  1. 验证
    在Prometheus页面可以看到kubelet target。
无法暴露controller-manager及scheduler的metrics
方法一

针对Kubernetes v1.13.0。

  1. 将下述内容添加到kubeadm.conf,并在kubeadm初始化时kubeadm init –config kubeadm.conf。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    apiVersion: kubeadm.k8s.io/v1alpha3
    kind: ClusterConfiguration
    kubernetesVersion: 1.13.0
    networking:
    podSubnet: 10.244.0.0/16
    controllerManagerExtraArgs:
    address: 0.0.0.0
    schedulerExtraArgs:
    address: 0.0.0.0
    ...
  2. 为pod打上label。

    1
    2
    3
    4
    kubectl get po -n kube-system
    kubectl -n kube-system label po kube-controller-manager-<nodename> k8s-app=kube-controller-manager
    kubectl -n kube-system label po kube-scheduler-<nodename> k8s-app=kube-scheduler
    kubectl get po -n kube-system --show-labels
  3. 验证
    在Prometheus页面可以看到kube-controller-manager及kube-scheduler两个target。
    在grafana页面可以看到controller-manager及scheduler的状态监控。

方法二

guide
针对1.13.0之前的Kubernetes。

  1. 修改kubeadm的核心配置。
    1
    kubeadm config view

将上述输出保存为newConfig.yaml,并添加以下两行:

1
2
3
4
controllerManagerExtraArgs:
address: 0.0.0.0
schedulerExtraArgs:
address: 0.0.0.0

应用新配置:

1
kubeadm config upload from-file --config newConfig.yaml

  1. 为pod打上label。

    1
    2
    3
    4
    kubectl get po -n kube-system
    kubectl label po kube-controller-manager-<nodename> k8s-app=kube-controller-manager
    kubectl label po kube-scheduler-<nodename> k8s-app=kube-scheduler
    kubectl get po -n kube-system --show-labels
  2. 重建exporters。

    1
    kubectl -n kube-system get svc

可以看到以下两个没有CLUSTER-IP的Service:

1
2
kube-prometheus-exporter-kube-controller-manager
kube-prometheus-exporter-kube-scheduler

1
2
kubectl -n kube-system get svc kube-prometheus-exporter-kube-controller-manager -o yaml
kubectl -n kube-system get svc kube-prometheus-exporter-kube-scheduler -o yaml

将上述输出分别保存为newKubeControllerManagerSvc.yaml和newKubeSchedulerSvc.yaml,删除一些非必要信息(如uid、selfLink、resourceVersion、creationTimestamp等)后重建。

1
2
3
kubectl delete -n kube-system svc kube-prometheus-exporter-kube-controller-manager kube-prometheus-exporter-kube-scheduler
kubectl apply -f newKubeControllerManagerSvc.yaml
kubectl apply -f newKubeSchedulerSvc.yaml

  1. 确保Prometheus pod到kube-controller-manager和kube-scheduler的NodePort 10251/10252的访问是通畅的。

  2. 验证与方法一相同。

无法暴露coredns

在Kubernetes v1.13.0中,集群DNS组件默认为coredns,因此需修改kube-prometheus的配置,才能监控到DNS服务的状态。

方法一
  1. 修改配置中的selectorLabel值与coredns的pod标签对应。
    1
    2
    3
    kubectl -n kube-system get po --show-labels | grep coredns
    # 输出
    coredns k8s-app=kube-dns
1
vim kube-prometheus/charts/exporter-coredns/values.yaml
1
2
#selectorLabel: coredns
selectorLabel: kube-dns
  1. 重启kube-prometheus。

    1
    2
    helm delete --purge kube-prometheus
    helm install --name kube-prometheus --namespace monitoring kube-prometheus
  2. 验证
    在Prometheus可以看到kube-dns target。

方法二
  1. 修改pod的标签与配置中的一致。

    1
    kubectl -n kube-system label po
  2. 验证与方法一相同。

附录:平台监控告警规则

gmt_alerting

1
vim charts/exporter-kube-controller-manager/templates/kube-controller-manager.rules.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{{ define "kube-controller-manager.rules.yaml.tpl" }}
groups:
- name: kube-controller-manager.rules
rules:
- alert: K8SControllerManagerDown
expr: absent(up{job="kube-controller-manager"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S controller manager. Deployments and replication
controllers are not making progress.
runbook: https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html#recovering-a-controller-manager
summary: Controller manager is down
{{ end }}
1
vim charts/exporter-kube-etcd/templates/etcd3.rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
{{ define "etcd3.rules.yaml.tpl" }}
groups:
- name: ./etcd3.rules
rules:
- alert: InsufficientMembers
expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)
for: 3m
labels:
severity: major
annotations:
description: If one more etcd member goes down the cluster will be unavailable
summary: etcd cluster insufficient members
- alert: NoLeader
expr: etcd_server_has_leader{job="etcd"} == 0
for: 1m
labels:
severity: critical
annotations:
description: etcd member {{`{{ $labels.instance }}`}} has no leader
summary: etcd member has no leader
- alert: HighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3
labels:
severity: warning
annotations:
description: etcd instance {{`{{ $labels.instance }}`}} has seen {{`{{ $value }}`}} leader
changes within the last hour
summary: a high number of leader changes within the etcd cluster are happening
- alert: HighNumberOfFailedGRPCRequests
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK",job="etcd"}[5m])) BY (grpc_service, grpc_method)
/ sum(rate(grpc_server_handled_total{job="etcd"}[5m])) BY (grpc_service, grpc_method) > 0.01
for: 10m
labels:
severity: warning
annotations:
description: '{{`{{ $value }}`}}% of requests for {{`{{ $labels.grpc_method }}`}} failed
on etcd instance {{`{{ $labels.instance }}`}}'

summary: a high number of gRPC requests are failing
- alert: HighNumberOfFailedGRPCRequests
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK",job="etcd"}[5m])) BY (grpc_service, grpc_method)
/ sum(rate(grpc_server_handled_total{job="etcd"}[5m])) BY (grpc_service, grpc_method) > 0.05
for: 5m
labels:
severity: minor
annotations:
description: '{{`{{ $value }}`}}% of requests for {{`{{ $labels.grpc_method }}`}} failed
on etcd instance {{`{{ $labels.instance }}`}}'

summary: a high number of gRPC requests are failing
- alert: GRPCRequestsSlow
expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd",grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le))
> 0.15
for: 10m
labels:
severity: minor
annotations:
description: on etcd instance {{`{{ $labels.instance }}`}} gRPC requests to {{`{{ $labels.grpc_method
}}`}} are slow
summary: slow gRPC requests
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.01
for: 10m
labels:
severity: warning
annotations:
description: '{{`{{ $value }}`}}% of requests for {{`{{ $labels.method }}`}} failed on etcd
instance {{`{{ $labels.instance }}`}}'

summary: a high number of HTTP requests are failing
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.05
for: 5m
labels:
severity: minor
annotations:
description: '{{`{{ $value }}`}}% of requests for {{`{{ $labels.method }}`}} failed on etcd
instance {{`{{ $labels.instance }}`}}'

summary: a high number of HTTP requests are failing
- alert: HTTPRequestsSlow
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: minor
annotations:
description: on etcd instance {{`{{ $labels.instance }}`}} HTTP requests to {{`{{ $labels.method
}}`}} are slow
summary: slow HTTP requests
- alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{`{{ $labels.instance }}`}} member communication with
{{`{{ $labels.To }}`}} is slow
summary: etcd member communication is slow
- alert: HighNumberOfFailedProposals
expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5
labels:
severity: warning
annotations:
description: etcd instance {{`{{ $labels.instance }}`}} has seen {{`{{ $value }}`}} proposal
failures within the last hour
summary: a high number of proposals within the etcd cluster are failing
- alert: HighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
> 0.5
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{`{{ $labels.instance }}`}} fync durations are high
summary: high fsync durations
- alert: HighCommitDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
> 0.25
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{`{{ $labels.instance }}`}} commit durations are high
summary: high commit durations
{{ end }}
1
vim charts/exporter-kubelets/templates/kubelet.rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
{{ define "kubelet.rules.yaml.tpl" }}
groups:
- name: kubelet.rules
rules:
- alert: K8SNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1h
labels:
severity: warning
annotations:
description: The Kubelet on {{`{{ $labels.node }}`}} has not checked in with the API,
or has set itself to NotReady, for more than an hour
summary: Node status is NotReady
- alert: K8SManyNodesNotReady
expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
> 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
for: 1m
labels:
severity: minor
annotations:
description: '{{`{{ $value }}`}}% of Kubernetes nodes are not ready'
- alert: K8SKubeletDown
expr: count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 10
for: 1h
labels:
severity: minor
annotations:
description: Prometheus failed to scrape {{`{{ $value }}`}}% of kubelets.
summary: Prometheus failed to scrape
- alert: K8SManyKubeletDown
expr: (absent(up{job="kubelet"} == 1) or count(up{job="kubelet"} == 0) / count(up{job="kubelet"}))
* 100 > 30
for: 1h
labels:
severity: major
annotations:
description: Prometheus failed to scrape {{`{{ $value }}`}}% of kubelets, or all Kubelets
have disappeared from service discovery.
summary: Many Kubelets cannot be scraped
- alert: K8SKubeletTooManyPods
expr: kubelet_running_pod_count > 100
for: 10m
labels:
severity: warning
annotations:
description: Kubelet {{`{{$labels.instance}}`}} is running {{`{{$value}}`}} pods, close
to the limit of 110
summary: Kubelet is close to pod limit
{{ end }}
1
vim charts/exporter-kubernetes/templates/kubernetes.rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
{{ define "kubernetes.rules.yaml.tpl" }}
groups:
- name: kubernetes.rules
rules:
- record: pod_name:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(pod_name)
- record: pod_name:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: pod_name:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
BY (pod_name)
- record: pod_name:container_fs_usage_bytes:sum
expr: sum(container_fs_usage_bytes{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: namespace:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!=""}) BY (namespace)
- record: namespace:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!=""}) BY (namespace)
- record: namespace:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD"}[5m]))
BY (namespace)
- record: cluster:memory_usage:ratio
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(cluster) / sum(machine_memory_bytes) BY (cluster)
- record: cluster:container_spec_cpu_shares:ratio
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) / 1000
/ sum(machine_cpu_cores)
- record: cluster:container_cpu_usage:ratio
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
/ sum(machine_cpu_cores)
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.99"
- record: apiserver_latency:quantile_seconds
expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.9"
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.5"
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 1
for: 10m
labels:
severity: warning
annotations:
description: the API server has a 99th percentile latency of {{`{{ $value }}`}} seconds
for {{`{{$labels.verb}}`}} {{`{{$labels.resource}}`}}
summary: API server high latency
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 4
for: 10m
labels:
severity: minor
annotations:
description: the API server has a 99th percentile latency of {{`{{ $value }}`}} seconds
for {{`{{$labels.verb}}`}} {{`{{$labels.resource}}`}}
summary: API server high latency
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 2
for: 10m
labels:
severity: warning
annotations:
description: API server returns errors for {{`{{ $value }}`}}% of requests
summary: API server request errors
- alert: APIServerErrorsVeryHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 5
for: 10m
labels:
severity: minor
annotations:
description: API server returns errors for {{`{{ $value }}`}}% of requests
summary: API server request very high error rate
- alert: K8SApiserverDown
expr: absent(up{job="apiserver"} == 1)
for: 20m
labels:
severity: critical
annotations:
description: No API servers are reachable or all have disappeared from service
discovery
summary: No API servers are reachable
- alert: K8sCertificateExpirationNotice
labels:
severity: minor
annotations:
description: Kubernetes API Certificate is expiring soon (less than 7 days)
summary: Kubernetes API Certificate is expiering soon
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="604800"}) > 0
- alert: K8sCertificateExpirationNotice
labels:
severity: major
annotations:
description: Kubernetes API Certificate is expiring in less than 1 day
summary: Kubernetes API Certificate is expiering
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="86400"}) > 0
{{ end }}
1
vim charts/exporter-kube-scheduler/templates/kube-scheduler.rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
{{ define "kube-scheduler.rules.yaml.tpl" }}
groups:
- name: kube-scheduler.rules
rules:
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- alert: K8SSchedulerDown
expr: absent(up{job="kube-scheduler"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S scheduler. New pods are not being assigned
to nodes.
runbook: https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html#recovering-a-scheduler
summary: Scheduler is down
1
vim charts/exporter-kube-state/templates/kube-state-metrics.rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
{{ define "kube-state-metrics.rules.yaml.tpl" }}
groups:
- name: kube-state-metrics.rules
rules:
- alert: DeploymentGenerationMismatch
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
for: 15m
labels:
severity: warning
annotations:
description: Observed deployment generation does not match expected one for
deployment {{`{{$labels.namespaces}}`}}/{{`{{$labels.deployment}}`}}
summary: Deployment is outdated
- alert: DeploymentReplicasNotUpdated
expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
unless (kube_deployment_spec_paused == 1)
for: 15m
labels:
severity: warning
annotations:
description: Replicas are not updated and available for deployment {{`{{$labels.namespaces}}`}}/{{`{{$labels.deployment}}`}}
summary: Deployment replicas are outdated
- alert: DaemonSetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled
* 100 < 100
for: 15m
labels:
severity: warning
annotations:
description: Only {{`{{$value}}`}}% of desired pods scheduled and ready for daemon
set {{`{{$labels.namespaces}}`}}/{{`{{$labels.daemonset}}`}}
summary: DaemonSet is missing pods
- alert: K8SDaemonSetsNotScheduled
expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled
> 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are not scheduled.
summary: Daemonsets are not scheduled correctly
- alert: DaemonSetsMissScheduled
expr: kube_daemonset_status_number_misscheduled > 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are running where they are not supposed
to run.
summary: Daemonsets are not scheduled correctly
- alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
description: Pod {{`{{$labels.namespaces}}`}}/{{`{{$labels.pod}}`}} is was restarted {{`{{$value}}`}}
times within the last hour
summary: Pod is restarting frequently
- alert: KeyServicePodRestarting
expr: increase(kube_pod_container_status_restarts_total{namespace="default",pod=~"ffdl-lcm-.*|ffdl-trainer-.*|ffdl-restapi-.*|ffdl-ui-.*|cc-deployment-.*|etcd.*|mongo.*|storage.*|alertmanager-.*|prometheus-.*|pushgateway-.*"}[2m]) > 0
for: 5m
labels:
severity: minor
annotations:
description: A key service restarted more than 0 times for last 2 mins
summary: A key service {{`{{$labels.pod}}`}} restarted
{{ end }}
1
vim charts/exporter-node/templates/node.rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
{{ define "node.rules.yaml.tpl" }}
groups:
- name: node.rules
rules:
- record: instance:node_cpu:rate:sum
expr: sum(rate(node_cpu{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m]))
BY (instance)
- record: instance:node_filesystem_usage:sum
expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"}))
BY (instance)
- record: instance:node_network_receive_bytes:rate:sum
expr: sum(rate(node_network_receive_bytes[3m])) BY (instance)
- record: instance:node_network_transmit_bytes:rate:sum
expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance)
- record: instance:node_cpu:ratio
expr: sum(rate(node_cpu{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance)
GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance)
- record: cluster:node_cpu:sum_rate5m
expr: sum(rate(node_cpu{mode!="idle"}[5m]))
- record: cluster:node_cpu:ratio
expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu))
- alert: CPUHighUsage15Mins
expr: 100 - (avg by (instance) (irate(node_cpu{mode="idle"}[15m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
description: CPU usage on Node {{`{{$labels.instance}}`}} is more than 50% for 15 mins
summary: Node is high on CPU usage for 5mins
- alert: CPUVeryHighUsage15Mins
expr: 100 - (avg by (instance) (irate(node_cpu{mode="idle"}[15m])) * 100) > 95
for: 5m
labels:
severity: minor
annotations:
description: CPU usage on Node {{`{{$labels.instance}}`}} is more than 80% for 15 mins
summary: Node is very high on CPU usage for 15mins
- alert: GPUHighTemp
expr: dcgm_gpu_temp > 80
for: 5m
labels:
severity: info
annotations:
description: GPU {{`{{$labels.gpu}}`}} on Node {{`{{$labels.instance}}`}} is {{`{{$value}}`}} degree
summary: Node GPU hits high temp
- alert: GPUVeryHighTemp
expr: dcgm_gpu_temp > 90
for: 5m
labels:
severity: warning
annotations:
description: GPU {{`{{$labels.gpu}}`}} on Node {{`{{$labels.instance}}`}} is {{`{{$value}}`}} degree
summary: Node GPU hits very high temp
- alert: GPUExtremelyHighTemp
expr: dcgm_gpu_temp > 95
for: 5m
labels:
severity: minor
annotations:
description: GPU {{`{{$labels.gpu}}`}} on Node {{`{{$labels.instance}}`}} is {{`{{$value}}`}} degree
summary: Node GPU hits extremely high temp
- alert: GPUAvgVeryHighTemp5Mins
expr: avg_over_time(dcgm_gpu_temp[5m]) > 90
for: 5m
labels:
severity: warning
annotations:
description: GPU {{`{{$labels.gpu}}`}} on Node {{`{{$labels.instance}}`}} is average {{`{{$value}}`}} degree for 5 mins
summary: Node is very high on GPU temp for 5mins
- alert: GPUAvgVeryHighTemp15Mins
expr: avg_over_time(dcgm_gpu_temp[15m]) > 90
for: 5m
labels:
severity: minor
annotations:
description: GPU {{`{{$labels.gpu}}`}} on Node {{`{{$labels.instance}}`}} is average {{`{{$value}}`}} degree for 15 mins
summary: Node is very high on GPU temp for 15mins
- alert: GPUAvgHighPowerUsage15Mins
expr: avg_over_time(dcgm_power_usage[15m]) > 300
for: 5m
labels:
severity: warning
annotations:
description: GPU {{`{{$labels.gpu}}`}} on Node {{`{{$labels.instance}}`}} consumes average {{`{{$value}}`}} W for 15 mins
summary: Node is very high on GPU power usage for 15mins
- alert: GPUAvgVeryHighPowerUsage15Mins
expr: avg_over_time(dcgm_power_usage[15m]) > 500
for: 5m
labels:
severity: minor
annotations:
description: GPU {{`{{$labels.gpu}}`}} on Node {{`{{$labels.instance}}`}} consumes average {{`{{$value}}`}} W for 15 mins
summary: Node is very high on GPU power usage for 15mins
- alert: GPUAvgHighGPUUsage15Mins
expr: avg_over_time(dcgm_gpu_utilization[15m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
description: GPU {{`{{$labels.gpu}}`}} usage on Node {{`{{$labels.instance}}`}} is average {{`{{$value}}`}} for 15 mins
summary: Node is high on GPU usage for 15mins
- alert: GPUAvgVeryHighGPUUsage15Mins
expr: avg_over_time(dcgm_gpu_utilization[15m]) > 0.95
for: 5m
labels:
severity: minor
annotations:
description: GPU {{`{{$labels.gpu}}`}} usage on Node {{`{{$labels.instance}}`}} is average {{`{{$value}}`}} for 15 mins
summary: Node is very high on GPU usage for 15mins
- alert: MemUsageHigh
expr: 100 - ((node_memory_MemFree+node_memory_Cached+node_memory_Buffers)/node_memory_MemTotal) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
description: Memory usage on Node {{`{{$labels.instance}}`}} is {{`{{$value}}`}}
summary: Node is high on memory usage
- alert: MemUsageVeryHigh
expr: 100 - ((node_memory_MemFree+node_memory_Cached+node_memory_Buffers)/node_memory_MemTotal) * 100 > 95
for: 5m
labels:
severity: minor
annotations:
description: Memory usage on Node {{`{{$labels.instance}}`}} is {{`{{$value}}`}}
summary: Node is very high on memory usage
- alert: MemAvgUsageHigh5Mins
expr: (1 - ((avg_over_time(node_memory_MemFree[5m]) + avg_over_time(node_memory_Cached[5m]) + avg_over_time(node_memory_Buffers[5m])) / avg_over_time(node_memory_MemTotal[5m]))) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
description: Average Memory usage on Node {{`{{$labels.instance}}`}} is {{`{{$value}}`}} for 5 mins
summary: Node is high on average memory usage for 5mins
- alert: MemUsageHigh15Mins
expr: (1 - ((avg_over_time(node_memory_MemFree[15m]) + avg_over_time(node_memory_Cached[15m]) + avg_over_time(node_memory_Buffers[15m])) / avg_over_time(node_memory_MemTotal[15m]))) * 100 > 90
for: 5m
labels:
severity: minor
annotations:
description: Average Memory usage on Node {{`{{$labels.instance}}`}} is {{`{{$value}}`}} for 15 mins
summary: Node is high on average memory usage for 15mins
- alert: NodeUnschedulable
expr: sum(kube_node_spec_unschedulable) > 0
for: 5m
labels:
severity: minor
annotations:
description: a node is unschedulable for 5 minutes
summary: Node is unschedulable
- alert: ManyNodesUnschedulable
expr: sum(kube_node_spec_unschedulable) / count(kube_node_created) * 100 > 50
for: 5m
labels:
severity: major
annotations:
description: more than 30% nodes are unschedulable for 5 minutes
summary: Many Nodes are unschedulable
- alert: NodeExporterDown
expr: absent(up{job="node-exporter"} == 1)
for: 10m
labels:
severity: minor
annotations:
description: Prometheus could not scrape a node-exporter for more than 10m,
or node-exporters have disappeared from discovery
summary: Prometheus could not scrape a node-exporter
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[6h], 3600 * 24) < 0
for: 30m
labels:
severity: warning
annotations:
description: device {{`{{$labels.device}}`}} on node {{`{{$labels.instance}}`}} is running full within the next 24 hours (mounted at {{`{{$labels.mountpoint}}`}})
summary: Node disk is running full within 24 hours
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
for: 10m
labels:
severity: major
annotations:
description: device {{`{{$labels.device}}`}} on node {{`{{$labels.instance}}`}} is running full within the next 2 hours (mounted at {{`{{$labels.mountpoint}}`}})
summary: Node disk is running full within 2 hours
{{ end }}